DATA-611 Final Project - Eland - Data Cleaning and Exploration¶

This notebook handles exploring, cleaning, and feature engineering for the credit card dataset.

It starts with the root dataset and saves to a file named Cleaned.csv

This notebook was originally developed in Azure Machine Learning Studio against the Python 3.8 - AzureML kernel on a STANDARD_E4DS_V4 compute instance

Project Goal: Predict the default of a credit card¶

This is a binary classification problem where we need to predict whether credit cards in the future are likely to default

Data Set¶

The training data set includes a binary variable, default payment (Yes = 1, No = 0), as the target variable, and the following 23 variables as the features variables:

  • X1: Amount of the given credit
  • X2: Gender (1 = male; 2 = female)
  • X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
  • X4: Marital status (1 = married; 2 = single; 3 = others)
  • X5: Age
  • X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
  • X12-X17: Amount of bill statement. X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
  • X18-X23: Amount of previous payment. X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

data file: https://raw.githubusercontent.com/franklin-univ-data-science/data/master/credit_default.csv

Dependencies¶

In [2]:
%pip install numpy
%pip install pandas
Requirement already satisfied: numpy in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (1.21.6)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: pandas in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.15.4 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas) (1.21.6)
Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas) (2022.5)
Requirement already satisfied: six>=1.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [3]:
# Set up Plotly express for visualization
%pip install plotly

import plotly.express as px

px.defaults.template = 'plotly_white'
px.defaults.color_continuous_scale = px.colors.sequential.Plasma
px.defaults.color_discrete_sequence = px.colors.qualitative.Vivid
Requirement already satisfied: plotly in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (5.15.0)
Requirement already satisfied: packaging in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from plotly) (23.0)
Requirement already satisfied: tenacity>=6.2.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from plotly) (8.2.2)
Note: you may need to restart the kernel to use updated packages.
In [4]:
%pip install seaborn

import seaborn as sns
import numpy as np

def plot_correlation(corr, x_size=25, y_size=18):
    """ Displays a diagonalized heatmap from a correlation matrix using seaborn. """

    # This is a nice diverging palette that shows positive and negative correlations easily
    cmap = sns.color_palette("seismic", as_cmap=True)

    # Create a mask to cut off the top-right of the triangle
    mask = np.zeros_like(corr, dtype=bool)
    mask[np.triu_indices_from(mask)] = True
    mask[np.diag_indices_from(mask)] = False # Show the 1.00 self-correlation diagonal line. It looks a little empty without it
    
    # Create the heatmap
    fig = sns.heatmap(corr, mask=mask, cmap=cmap, center=0, linewidths=.5, annot=True, fmt='.2f', vmin=-1, vmax=1, xticklabels=True, yticklabels=True)   
    fig.set_xticklabels(fig.get_xticklabels(), rotation=45, horizontalalignment='right')
    fig.set_yticklabels(fig.get_yticklabels(), rotation=0, horizontalalignment='right')
    fig.axes.set_title('Feature Correlation')
    fig.figure.set_size_inches(x_size, y_size)
Requirement already satisfied: seaborn in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.12.2)
Requirement already satisfied: pandas>=0.25 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from seaborn) (1.1.5)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from seaborn) (1.21.6)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from seaborn) (3.2.1)
Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.25->seaborn) (2022.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.25->seaborn) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.1.0)
Requirement already satisfied: cycler>=0.10 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: six>=1.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.25->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

Data Loading¶

In [5]:
import pandas as pd

# Load credit history data
df = pd.read_csv('https://raw.githubusercontent.com/franklin-univ-data-science/data/master/credit_default.csv')
df.head()
Out[5]:
ID X1 X2 X3 X4 X5 X6 X7 X8 X9 ... X15 X16 X17 X18 X19 X20 X21 X22 X23 Y
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

In [6]:
# That ID column is useless; let's use it as a row index
df.set_index('ID', inplace=True)
df.head()
Out[6]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ... X15 X16 X17 X18 X19 X20 X21 X22 X23 Y
ID
1 20000 2 2 1 24 2 2 -1 -1 -2 ... 0 0 0 0 689 0 0 0 0 1
2 120000 2 2 2 26 -1 2 0 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
3 90000 2 2 2 34 0 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
4 50000 2 2 1 37 0 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
5 50000 1 2 1 57 -1 0 -1 0 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 24 columns

In [7]:
# Get a sense of its scale
df.shape
Out[7]:
(30000, 24)
In [8]:
# Rename columns so they make more sense to read and chart
df.rename(columns={'X1' : 'Credit Amount', 
                   'X2' : 'Gender', 
                   'X3' : 'Education', 
                   'X4' : 'Marital Status', 
                   'X5' : 'Age in Years', 
                   'X6' : 'Repay Delay Sep',
                   'X7' : 'Repay Delay Aug',
                   'X8' : 'Repay Delay Jul',
                   'X9' : 'Repay Delay Jun',
                   'X10': 'Repay Delay May',
                   'X11': 'Repay Delay Apr',
                   'X12': 'Bill Sep',
                   'X13': 'Bill Aug',
                   'X14': 'Bill Jul',
                   'X15': 'Bill Jun',
                   'X16': 'Bill May',
                   'X17': 'Bill Apr',
                   'X18': 'Prior Pay Sep',
                   'X19': 'Prior Pay Aug',
                   'X20': 'Prior Pay Jul',
                   'X21': 'Prior Pay Jun',
                   'X22': 'Prior Pay May',
                   'X23': 'Prior Pay Apr',  
                   'Y'  : 'Defaulted'
                   }, 
          inplace=True)
df.head()
Out[8]:
Credit Amount Gender Education Marital Status Age in Years Repay Delay Sep Repay Delay Aug Repay Delay Jul Repay Delay Jun Repay Delay May ... Bill Jun Bill May Bill Apr Prior Pay Sep Prior Pay Aug Prior Pay Jul Prior Pay Jun Prior Pay May Prior Pay Apr Defaulted
ID
1 20000 2 2 1 24 2 2 -1 -1 -2 ... 0 0 0 0 689 0 0 0 0 1
2 120000 2 2 2 26 -1 2 0 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
3 90000 2 2 2 34 0 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
4 50000 2 2 1 37 0 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
5 50000 1 2 1 57 -1 0 -1 0 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 24 columns

In [9]:
# Read the schema
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Credit Amount    30000 non-null  int64
 1   Gender           30000 non-null  int64
 2   Education        30000 non-null  int64
 3   Marital Status   30000 non-null  int64
 4   Age in Years     30000 non-null  int64
 5   Repay Delay Sep  30000 non-null  int64
 6   Repay Delay Aug  30000 non-null  int64
 7   Repay Delay Jul  30000 non-null  int64
 8   Repay Delay Jun  30000 non-null  int64
 9   Repay Delay May  30000 non-null  int64
 10  Repay Delay Apr  30000 non-null  int64
 11  Bill Sep         30000 non-null  int64
 12  Bill Aug         30000 non-null  int64
 13  Bill Jul         30000 non-null  int64
 14  Bill Jun         30000 non-null  int64
 15  Bill May         30000 non-null  int64
 16  Bill Apr         30000 non-null  int64
 17  Prior Pay Sep    30000 non-null  int64
 18  Prior Pay Aug    30000 non-null  int64
 19  Prior Pay Jul    30000 non-null  int64
 20  Prior Pay Jun    30000 non-null  int64
 21  Prior Pay May    30000 non-null  int64
 22  Prior Pay Apr    30000 non-null  int64
 23  Defaulted        30000 non-null  int64
dtypes: int64(24)
memory usage: 5.7 MB

Looks like no null values in any columns

In [10]:
# Before we do anything more extreme, let's get descriptive statistics
df.describe()
Out[10]:
Credit Amount Gender Education Marital Status Age in Years Repay Delay Sep Repay Delay Aug Repay Delay Jul Repay Delay Jun Repay Delay May ... Bill Jun Bill May Bill Apr Prior Pay Sep Prior Pay Aug Prior Pay Jul Prior Pay Jun Prior Pay May Prior Pay Apr Defaulted
count 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 ... 30000.000000 30000.000000 30000.000000 30000.000000 3.000000e+04 30000.00000 30000.000000 30000.000000 30000.000000 30000.000000
mean 167484.322667 1.603733 1.853133 1.551867 35.485500 -0.016700 -0.133767 -0.166200 -0.220667 -0.266200 ... 43262.948967 40311.400967 38871.760400 5663.580500 5.921163e+03 5225.68150 4826.076867 4799.387633 5215.502567 0.221200
std 129747.661567 0.489129 0.790349 0.521970 9.217904 1.123802 1.197186 1.196868 1.169139 1.133187 ... 64332.856134 60797.155770 59554.107537 16563.280354 2.304087e+04 17606.96147 15666.159744 15278.305679 17777.465775 0.415062
min 10000.000000 1.000000 0.000000 0.000000 21.000000 -2.000000 -2.000000 -2.000000 -2.000000 -2.000000 ... -170000.000000 -81334.000000 -339603.000000 0.000000 0.000000e+00 0.00000 0.000000 0.000000 0.000000 0.000000
25% 50000.000000 1.000000 1.000000 1.000000 28.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... 2326.750000 1763.000000 1256.000000 1000.000000 8.330000e+02 390.00000 296.000000 252.500000 117.750000 0.000000
50% 140000.000000 2.000000 2.000000 2.000000 34.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 19052.000000 18104.500000 17071.000000 2100.000000 2.009000e+03 1800.00000 1500.000000 1500.000000 1500.000000 0.000000
75% 240000.000000 2.000000 2.000000 2.000000 41.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 54506.000000 50190.500000 49198.250000 5006.000000 5.000000e+03 4505.00000 4013.250000 4031.500000 4000.000000 0.000000
max 1000000.000000 2.000000 6.000000 3.000000 79.000000 8.000000 8.000000 8.000000 8.000000 8.000000 ... 891586.000000 927171.000000 961664.000000 873552.000000 1.684259e+06 896040.00000 621000.000000 426529.000000 528666.000000 1.000000

8 rows × 24 columns

Data Wrangling¶

Let's make these columns a bit more relevant to what we want to do

Defaulted (Label / Target Column)¶

In [11]:
# Map defaulted values of 1 to True and 0 to False
df['Defaulted'] = df['Defaulted'].eq(1)
df['Defaulted'].value_counts()
Out[11]:
False    23364
True      6636
Name: Defaulted, dtype: int64
In [12]:
# And again, but normalized so we see percentages
df['Defaulted'].value_counts(normalize=True)
Out[12]:
False    0.7788
True     0.2212
Name: Defaulted, dtype: float64
In [13]:
# Plot a histogram of Defaulted
px.histogram(df, x='Defaulted', color='Defaulted', title='Target Class Distribution')

Congratulations, we have a minor class imbalance. It's not critical at 22%, but it's something to watch.

Gender¶

In [14]:
# See what the gender data looks like
df['Gender'].value_counts()
Out[14]:
2    18112
1    11888
Name: Gender, dtype: int64

Slight imbalance there. We do know that 1 is male, so 2 may be female / other / decline to comment. Let's rename the column to be more clear

In [15]:
df['Gender'] = df['Gender'].eq(1)
df.rename(columns={'Gender': 'Is Male'}, inplace=True)
df['Is Male'].value_counts()
Out[15]:
False    18112
True     11888
Name: Is Male, dtype: int64
In [16]:
# Plot a histogram of Is Male
px.histogram(df, x='Is Male', color='Defaulted', title='Gender Distribution')

Education¶

In [17]:
# Let's look at education next. The requirements say "(1 = graduate school; 2 = university; 3 = high school; 4 = others)"
df['Education'].value_counts()
Out[17]:
2    14030
1    10585
3     4917
5      280
4      123
6       51
0       14
Name: Education, dtype: int64

More lies! We have 5, 6, and 0 which are not mentioned. Let's roll them into the others category (4)

In [18]:
# Map education values of 1 to "Graduate School", 2 to "University", 3 to "High School", and everything else to "Other"
def map_education(education):
    if education == 1:
        return 'Graduate School'
    elif education == 2:
        return 'University'
    elif education == 3:
        return 'High School'
    else:
        return 'Other Education'
    
df['Education'] = df['Education'].apply(map_education)
df['Education'].value_counts()
Out[18]:
University         14030
Graduate School    10585
High School         4917
Other Education      468
Name: Education, dtype: int64
In [19]:
# Plot a histogram of Education
px.histogram(df, x='Education', color='Defaulted', title='Education Distribution')

There's definite % differences in the different educational buckets, and there may not be enough rows in other or high school. Let's introduce dummy columns for all of these values.

In [20]:
df = pd.get_dummies(df, columns=['Education'], prefix='', prefix_sep='')
df.head()
Out[20]:
Credit Amount Is Male Marital Status Age in Years Repay Delay Sep Repay Delay Aug Repay Delay Jul Repay Delay Jun Repay Delay May Repay Delay Apr ... Prior Pay Aug Prior Pay Jul Prior Pay Jun Prior Pay May Prior Pay Apr Defaulted Graduate School High School Other Education University
ID
1 20000 False 1 24 2 2 -1 -1 -2 -2 ... 689 0 0 0 0 True 0 0 0 1
2 120000 False 2 26 -1 2 0 0 0 2 ... 1000 1000 1000 0 2000 True 0 0 0 1
3 90000 False 2 34 0 0 0 0 0 0 ... 1500 1000 1000 1000 5000 False 0 0 0 1
4 50000 False 1 37 0 0 0 0 0 0 ... 2019 1200 1100 1069 1000 False 0 0 0 1
5 50000 True 1 57 -1 0 -1 0 0 0 ... 36681 10000 9000 689 679 False 0 0 0 1

5 rows × 27 columns

Marital Status¶

In [21]:
# The requirements say "(1 = married; 2 = single; 3 = others)". Let's see if that's what we have
df['Marital Status'].value_counts()
Out[21]:
2    15964
1    13659
3      323
0       54
Name: Marital Status, dtype: int64
In [22]:
# Map marital status values of 1 to "Married", 2 to "Single", and everything else to "Other"
def map_marital_status(status):
    if status == 1:
        return 'Married'
    elif status == 2:
        return 'Single'
    else:
        return 'Other Marital Status'
    
df['Marital Status'] = df['Marital Status'].apply(map_marital_status)
df['Marital Status'].value_counts()
Out[22]:
Single                  15964
Married                 13659
Other Marital Status      377
Name: Marital Status, dtype: int64
In [23]:
# Plot a histogram of Marital Status
px.histogram(df, x='Marital Status', color='Defaulted', title='Marital Status Distribution')

Okay, so other is nearly non-existent. This means that married and single could likely be reduced to a single boolean column for Is Married. Other correlation analysis supports this, so let's add the column and drop the marital status column.

In [24]:
df['Is Married'] = df['Marital Status'].eq('Married')
df.drop(columns=['Marital Status'], inplace=True)
df['Is Married'].value_counts()
Out[24]:
False    16341
True     13659
Name: Is Married, dtype: int64

Repayment Delay¶

In [25]:
df['Repay Delay Sep'].value_counts()
Out[25]:
 0    14737
-1     5686
 1     3688
-2     2759
 2     2667
 3      322
 4       76
 5       26
 8       19
 6       11
 7        9
Name: Repay Delay Sep, dtype: int64

The repayment scale is weird with -1 indicating 0 months delay and 1+ indicating the number of months delay. It'll be easier to calculate if we move those -1's to 0's.

Also note that there's a -2 value which the requirements don't mention. Let's assume that means the customer paid early and treat it as 0 months so our scale starts at 0

Reasoning: We don't care how early they paid, we just care if* they paid and didn't later default*

In [26]:
# Make sure all repayment columns have a minimum value of 0. Let's ignore negative values
for column in df.columns:
    if column.startswith('Repay Delay'):
        df[column] = df[column].apply(lambda x: max(x, 0))
        fig = px.histogram(df, x=column, color='Defaulted', title=column + ' Distribution')
        fig.show()

Feature Engineering¶

Some of our columns beg for additional columns to be engineered involving them

Total Columns¶

It's probably helpful to have total columns incidating the total amount for bill / prior pay / repay delay across all time periods

In [27]:
df['Bill Total'] = df['Bill Sep'] + df['Bill Aug'] + df['Bill Jul'] + df['Bill Jun'] + df['Bill May'] + df['Bill Apr']
df['Prior Pay Total'] = df['Prior Pay Sep'] + df['Prior Pay Aug'] + df['Prior Pay Jul'] + df['Prior Pay Jun'] + df['Prior Pay May'] + df['Prior Pay Apr']
df['Repay Delay Total'] = df['Repay Delay Sep'] + df['Repay Delay Aug'] + df['Repay Delay Jul'] + df['Repay Delay Jun'] + df['Repay Delay May'] + df['Repay Delay Apr']
df.head()
Out[27]:
Credit Amount Is Male Age in Years Repay Delay Sep Repay Delay Aug Repay Delay Jul Repay Delay Jun Repay Delay May Repay Delay Apr Bill Sep ... Prior Pay Apr Defaulted Graduate School High School Other Education University Is Married Bill Total Prior Pay Total Repay Delay Total
ID
1 20000 False 24 2 2 0 0 0 0 3913 ... 0 True 0 0 0 1 True 7704 689 4
2 120000 False 26 0 2 0 0 0 2 2682 ... 2000 True 0 0 0 1 False 17077 5000 4
3 90000 False 34 0 0 0 0 0 0 29239 ... 5000 False 0 0 0 1 False 101653 11018 0
4 50000 False 37 0 0 0 0 0 0 46990 ... 1000 False 0 0 0 1 True 231334 8388 0
5 50000 True 57 0 0 0 0 0 0 8617 ... 679 False 0 0 0 1 True 109339 59049 0

5 rows × 30 columns

In [28]:
px.histogram(df, x='Bill Total', color='Defaulted', title='Bill Total Distribution')
In [29]:
px.histogram(df, x='Prior Pay Total', color='Defaulted', title='Prior Pay Total Distribution')
In [30]:
px.histogram(df, x='Repay Delay Total', color='Defaulted', title='Repay Delay Total Distribution')

Interesting how a total repay delay of 1 is much less likely to default

Payment Delta Columns¶

We probably don't care what the prior payment amount or the bill amount, we just care about their balance (bill - payment)

In [31]:
df['Balance Sep'] = df['Bill Sep'] - df['Prior Pay Sep']
df['Balance Aug'] = df['Bill Aug'] - df['Prior Pay Aug']
df['Balance Jul'] = df['Bill Jul'] - df['Prior Pay Jul']
df['Balance Jun'] = df['Bill Jun'] - df['Prior Pay Jun']
df['Balance May'] = df['Bill May'] - df['Prior Pay May']
df['Balance Apr'] = df['Bill Apr'] - df['Prior Pay Apr']
df['Balance Total'] = df['Bill Total'] - df['Prior Pay Total']
df.head()
Out[31]:
Credit Amount Is Male Age in Years Repay Delay Sep Repay Delay Aug Repay Delay Jul Repay Delay Jun Repay Delay May Repay Delay Apr Bill Sep ... Bill Total Prior Pay Total Repay Delay Total Balance Sep Balance Aug Balance Jul Balance Jun Balance May Balance Apr Balance Total
ID
1 20000 False 24 2 2 0 0 0 0 3913 ... 7704 689 4 3913 2413 689 0 0 0 7015
2 120000 False 26 0 2 0 0 0 2 2682 ... 17077 5000 4 2682 725 1682 2272 3455 1261 12077
3 90000 False 34 0 0 0 0 0 0 29239 ... 101653 11018 0 27721 12527 12559 13331 13948 10549 90635
4 50000 False 37 0 0 0 0 0 0 46990 ... 231334 8388 0 44990 46214 48091 27214 27890 28547 222946
5 50000 True 57 0 0 0 0 0 0 8617 ... 109339 59049 0 6617 -31011 25835 11940 18457 18452 50290

5 rows × 37 columns

In [32]:
# Show histograms of all balance columns
for column in df.columns:
    if column.startswith('Balance'):
        fig = px.histogram(df, x=column, color='Defaulted', title=column + ' Distribution')
        fig.show()

Additional Visualization¶

In [33]:
fig = px.scatter_3d(df, x='Balance Total', z='Prior Pay Total', y='Bill Total', color='Defaulted', title='Balance Relationships')

# Make the figure tall so we can see it
fig.update_layout(height=600)
In [34]:
fig = px.scatter_3d(df, x='Balance Total', z='Prior Pay Total', y='Credit Amount', color='Defaulted', title='Balance Relationships with Credit Amount')

# Make the figure tall so we can see it
fig.update_layout(height=600)
In [35]:
fig = px.scatter_3d(df, x='Credit Amount', z='Prior Pay Total', y='Repay Delay Total', color='Defaulted', title='Balance Relationships with Repay Delay Total')

# Make the figure tall so we can see it
fig.update_layout(height=600)
In [36]:
fig = px.scatter_3d(df, x='Credit Amount', z='Prior Pay Total', y='Repay Delay Total', color='Balance Total', title='Balance Relationships with Credit Amount')

# Make the figure tall so we can see it
fig.update_layout(height=600)
In [37]:
px.histogram(df, x='Credit Amount', title='Credit Amount Distribution', color='Defaulted', marginal='box')
In [38]:
px.histogram(df, x='Prior Pay Total', title='Prior Pay Distribution', color='Defaulted', marginal='box')
In [39]:
px.histogram(df, x='Repay Delay Total', title='Repay Delay Distribution', color='Defaulted', marginal='box')

Based on this, we probably should ignore rows with prior payment of > 2.5 million, prior repay delay total of > 5, and prior pay of $100k. This makes these data points less likely to influence the model training process and clarifies the effective range of our model.

Outlier Removal¶

We clearly have some outliers. These will impact our model training and correlation analysis, so let's get rid of them now.

In [40]:
# Create a new filtered dataframe that removes identified outliers
df = df[df['Prior Pay Total'] <= 2500000]
df = df[df['Prior Pay Total'] <= 100000]
df = df[df['Repay Delay Total'] <= 5]

df.describe()
Out[40]:
Credit Amount Age in Years Repay Delay Sep Repay Delay Aug Repay Delay Jul Repay Delay Jun Repay Delay May Repay Delay Apr Bill Sep Bill Aug ... Bill Total Prior Pay Total Repay Delay Total Balance Sep Balance Aug Balance Jul Balance Jun Balance May Balance Apr Balance Total
count 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 ... 2.473800e+04 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 24738.000000 2.473800e+04
mean 166257.970733 35.436211 0.202967 0.129477 0.103646 0.066942 0.049721 0.074177 48214.109588 45427.235185 ... 2.426526e+05 21360.925055 0.626930 44169.446398 41483.419314 38818.686191 34859.830827 31757.426186 30202.898294 2.212917e+05
std 124821.754166 9.210237 0.510864 0.494391 0.445713 0.360742 0.312572 0.381638 69199.590220 66268.583243 ... 3.509089e+05 21019.749727 1.223993 67712.457447 64904.621849 61814.285509 56851.045257 53138.152046 51886.413080 3.415802e+05
min 10000.000000 21.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -11545.000000 -69777.000000 ... -2.595230e+05 0.000000 0.000000 -89921.000000 -139777.000000 -90676.000000 -90671.000000 -89273.000000 -122246.000000 -2.599780e+05
25% 60000.000000 28.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2551.000000 2000.000000 ... 2.072225e+04 6505.000000 0.000000 384.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.401250e+03
50% 140000.000000 34.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 19841.500000 18790.000000 ... 1.048005e+05 13837.500000 0.000000 17205.500000 16343.500000 15540.000000 13954.000000 11531.000000 9816.000000 8.677200e+04
75% 230000.000000 41.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 63569.250000 59516.750000 ... 3.000275e+05 29904.750000 1.000000 60135.500000 56063.250000 51009.250000 46558.750000 42179.250000 39727.250000 2.805608e+05
max 800000.000000 79.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 746814.000000 605943.000000 ... 2.862678e+06 100000.000000 5.000000 726314.000000 591943.000000 663643.000000 470005.000000 535865.000000 679638.000000 2.786678e+06

8 rows × 34 columns

Correlation Analysis¶

In [41]:
plot_correlation(df.corr())

This is pretty busy, but I'm immediately drawn to the bill columns not being significantly relevant to defaulted, so let's drop those columns

In [42]:
df.drop(columns=['Bill Sep', 'Bill Aug', 'Bill Jul', 'Bill Jun', 'Bill May', 'Bill Apr', 'Bill Total'], inplace=True)

plot_correlation(df.corr())

Looking more closely, the balance columns don't really impact Default and tend to be correlated with prior payment columns and credit amount, so let's drop those columns

In [43]:
df.drop(columns=['Balance Sep', 'Balance Aug', 'Balance Jul', 'Balance Jun', 'Balance May', 'Balance Apr', 'Balance Total'], inplace=True)
In [44]:
plot_correlation(df.corr(), x_size=18, y_size=12)
In [45]:
# Other education is meaningless, so let's drop it
df.drop(columns=['Other Education'], inplace=True)

# Prior pay June - April isn't all that relevant, so let's drop it too. We'll keep September and August as well as the total
df.drop(columns=['Prior Pay Jun', 'Prior Pay May', 'Prior Pay Apr'], inplace=True)

plot_correlation(df.corr(), x_size=16, y_size=10)

Is Male, Is Married, and Age in Years aren't very relevant, but might be areas of bias in the model we'd like to know about. Let's keep them for bias detection, but we shouldn't give them to our model

In [46]:
# Let's examine our remaining educational features and their correlations to defaulted

corr = df[['Defaulted', 'Credit Amount', 'High School', 'University', 'Graduate School']].corr()

plot_correlation(corr, x_size=4, y_size=3)

Seems like education might just be a component of credit amount. Let's represent education just by graduate school since that 1 or 0 will inform the impact on credit and on defaulted

In [47]:
df.drop(columns=['High School', 'University'], inplace=True)

plot_correlation(df.corr(), x_size=12, y_size=8)
In [48]:
# Let's drop Repay Delays past July to simplify things since they're not very significant and represented in the total
df.drop(columns=['Repay Delay Jun', 'Repay Delay May', 'Repay Delay Apr'], inplace=True)

plot_correlation(df.corr(), x_size=12, y_size=8)

This may still be more than we need, but we'll say these are our final features. Let's see the correlations without the bias detection columns:

In [49]:
plot_correlation(df.drop(columns=['Is Male', 'Age in Years', 'Is Married']).corr(), x_size=9, y_size=6) # Doesn't actually remove the columns from df

Export Cleaned Dataset¶

Now that we've dropped non-relevant bits, lets save the data to a CSV file called cleaned.csv

In [53]:
# Save the file to disk at cleaned.csv, ensuring the index is also persisted
df.to_csv('cleaned.csv', index=True)

Work continues in TrainTestSplit.ipynb